Working with CSV Files

What You’ll Learn

  • How files work in Python: text vs binary, opening and closing
  • What CSV files are and why they dominate data science
  • Reading and writing CSVs with the csv module
  • Combining data from multiple CSV files
  • Using pathlib.Path for clean file path handling

Concept

What Are Files?

Everything on your computer is stored in files. For programming, there are two categories:

  • Text files contain human-readable characters (letters, numbers, punctuation). You can open them in any text editor. Examples: .txt, .csv, .py, .json.
  • Binary files contain raw bytes that only specific programs can interpret. Examples: .xlsx, .pdf, .jpg, .sqlite.

What Is CSV?

CSV stands for Comma-Separated Values. It is the simplest structured data format:

player_id,name,handicap
1,Bear Woods,2.1
2,Brian Kolowitz,13.9
3,Sam Shanks,18.7
4,Bobby Bogey,25.3
  • The first row is typically the header (column names).
  • Each subsequent row is a record.
  • Values are separated by commas (the delimiter).

Why CSV Is Everywhere in Data Science

  1. Universal – every tool reads CSV: Python, R, Excel, databases, cloud services.
  2. Simple – no special software needed to create or inspect one.
  3. Lightweight – plain text with zero overhead.
  4. Version-control friendly – diffs are meaningful because it is just text.

A Note on File Paths

Our golf data lives in ../../data/ relative to this notebook. We will start with plain strings for paths, then switch to Python’s pathlib.Path at the end of this notebook. pathlib gives you an object-oriented way to build and manipulate file paths that works across operating systems.


Code

1. File I/O Basics with open()

Python’s built-in open() function returns a file object. The with statement ensures the file is automatically closed when the block ends, even if an error occurs.

# Reading a file
with open('../../data/players.csv', 'r') as f:
    contents = f.read()

print(contents)
# The file is closed automatically after the with block
print(f'File closed? {f.closed}')
# Reading line by line
with open('../../data/players.csv', 'r') as f:
    for line in f:
        print(repr(line))  # repr() shows the newline characters
# Writing a file
with open('sample_output.txt', 'w') as f:
    f.write('Round 1: Bear Woods scored 79\n')
    f.write('Round 2: Brian Kolowitz scored 97\n')

# Verify it wrote correctly
with open('sample_output.txt', 'r') as f:
    print(f.read())

2. Reading CSV Manually (and Why It Breaks)

The naive approach: split each line on commas.

with open('../../data/players.csv', 'r') as f:
    for line in f:
        fields = line.strip().split(',')
        print(fields)

This works for simple data, but it is fragile. Consider what happens when a value itself contains a comma:

course_id,name,location
1,"Bob O'Connor Golf Course, Schenley Park",Pittsburgh

A naive split(',') would break "Bob O'Connor Golf Course, Schenley Park" into two fields. The csv module handles quoting, escaping, and edge cases correctly.

# Demonstrating the problembad_line = '1,"Bob O\'Connor Golf Course, Schenley Park",Pittsburgh'print('Naive split:', bad_line.split(','))# Expected: ['1', "Bob O'Connor Golf Course, Schenley Park", 'Pittsburgh']# Got: ['1', '"Bob O\'Connor Golf Course', ' Schenley Park"', 'Pittsburgh']

3. The csv Module

Python’s standard library includes the csv module. Two main readers:

  • csv.reader – returns each row as a list of strings.
  • csv.DictReader – returns each row as a dictionary with header names as keys.
import csv
# csv.reader -- rows as lists
with open('../../data/players.csv', 'r') as f:
    reader = csv.reader(f)
    header = next(reader)  # pull out the header row
    print('Header:', header)
    print()
    for row in reader:
        print(row)
# csv.DictReader -- rows as dictionaries
with open('../../data/players.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)
# DictReader makes field access readable
with open('../../data/players.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(f"{row['name']} has a handicap of {row['handicap']}")

Key difference: With csv.reader, you access fields by index (row[1]). With csv.DictReader, you access them by name (row['name']). DictReader is almost always preferable because the code is self-documenting and won’t break if columns are reordered.

4. Reading the Golf Data Files

Let’s explore the course’s golf datasets. We will build a reusable pattern: read a CSV into a list of dictionaries.

def read_csv(filepath):
    """Read a CSV file and return a list of dictionaries."""
    with open(filepath, 'r') as f:
        reader = csv.DictReader(f)
        return list(reader)
players = read_csv('../../data/players.csv')
print(f'{len(players)} players loaded')
for p in players:
    print(f"  {p['name']:20s} handicap: {p['handicap']}")
rounds = read_csv('../../data/rounds.csv')
print(f'{len(rounds)} rounds loaded')
print()
# Show the first 5 rounds
for r in rounds[:5]:
    print(r)
courses = read_csv('../../data/courses.csv')
print(f'{len(courses)} courses loaded')
for c in courses:
    print(f"  {c['name']:30s} {c['city']}, {c['state']}  (slope: {c['slope_rating']})")

Important: Notice that every value from csv.DictReader is a string. If you need to do math, you must convert types explicitly.

# All values are strings
print(type(rounds[0]['total_score']))  # <class 'str'>
print(type(rounds[0]['player_id']))    # <class 'str'>
# Calculate average score -- must convert to int first
scores = [int(r['total_score']) for r in rounds]
avg_score = sum(scores) / len(scores)
print(f'Average score across all rounds: {avg_score:.1f}')

5. Combining Data from Multiple CSVs

In a relational dataset, information is split across files and connected by shared keys (like player_id). This is a join. Let’s combine rounds.csv and players.csv to see player names alongside their scores.

# Build a lookup dictionary: player_id -> player name
player_lookup = {}
for p in players:
    player_lookup[p['player_id']] = p['name']

print(player_lookup)
# Join rounds with player names
for r in rounds:
    name = player_lookup[r['player_id']]
    print(f"{r['date']}  {name:20s}  score: {r['total_score']}  ({r['weather']})")
# Build a richer joined dataset as a list of dicts
course_lookup = {c['course_id']: c['name'] for c in courses}

joined_rounds = []
for r in rounds:
    joined_rounds.append({
        'date': r['date'],
        'player': player_lookup[r['player_id']],
        'course': course_lookup[r['course_id']],
        'score': int(r['total_score']),
        'weather': r['weather']
    })

# Show the first 5 joined records
for jr in joined_rounds[:5]:
    print(jr)
# Find each player's best (lowest) round
best_rounds = {}
for jr in joined_rounds:
    player = jr['player']
    if player not in best_rounds or jr['score'] < best_rounds[player]['score']:
        best_rounds[player] = jr

print('Best round per player:')
print('-' * 60)
for player, info in best_rounds.items():
    print(f"{player:20s}  {info['score']}  at {info['course']}  ({info['date']})")

6. Writing CSV Files

The csv module provides csv.writer and csv.DictWriter for creating CSV files.

# csv.writer -- write rows as lists
with open('sunny_rounds.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['date', 'player', 'course', 'score'])  # header
    for jr in joined_rounds:
        if jr['weather'] == 'sunny':
            writer.writerow([jr['date'], jr['player'], jr['course'], jr['score']])

# Verify output
with open('sunny_rounds.csv', 'r') as f:
    print(f.read())
# csv.DictWriter -- write rows as dictionaries
fieldnames = ['date', 'player', 'course', 'score', 'weather']

with open('all_rounds_joined.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(joined_rounds)

# Verify output
with open('all_rounds_joined.csv', 'r') as f:
    print(f.read())

Note the newline='' parameter when opening files for writing with the csv module. This prevents the writer from adding extra blank lines on Windows. It is a good habit to always include it.

7. Using pathlib.Path for File Paths

String concatenation for file paths is error-prone (forward slashes vs backslashes, missing separators). pathlib.Path solves this.

from pathlib import Path
# Build a path using the / operator
data_dir = Path('..') / '..' / 'data'
print(data_dir)
print(data_dir.resolve())  # absolute path
# List all CSV files in the data directory
for csv_file in sorted(data_dir.glob('*.csv')):
    print(f'{csv_file.name:20s}  {csv_file.stat().st_size:>8,} bytes')
# Use Path objects directly with open() and csv
players_path = data_dir / 'players.csv'

with open(players_path, 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['name'])
# Useful Path methods
p = data_dir / 'rounds.csv'
print(f'exists:  {p.exists()}')
print(f'is_file: {p.is_file()}')
print(f'suffix:  {p.suffix}')
print(f'stem:    {p.stem}')
print(f'parent:  {p.parent}')
# Updated read_csv function using pathlib
def read_csv_path(filepath):
    """Read a CSV file and return a list of dictionaries. Accepts str or Path."""
    filepath = Path(filepath)
    if not filepath.exists():
        raise FileNotFoundError(f'No such file: {filepath}')
    with open(filepath, 'r') as f:
        return list(csv.DictReader(f))

# Use it
holes = read_csv_path(data_dir / 'holes.csv')
print(f'{len(holes)} hole records loaded')
print(f'First record: {holes[0]}')

AI

Exercise 1: Ask AI to Join and Analyze

Give an AI assistant this prompt:

Write a Python script that reads rounds.csv and players.csv from ../../data/, joins them by player_id, and prints each player’s best (lowest) round with the date and score.

What to look for in the AI’s response: - Does it use with open() so files are properly closed? - Does it use csv.DictReader or csv.reader? - Does it convert total_score to int before comparing? - Does it build a lookup dictionary for the join, or does it use a nested loop? (Both work, but the lookup is more efficient.)

# Paste the AI-generated code here and run it.
# Compare the output to our results from Section 5 above.

Exercise 2: Ask AI to Explain DictReader vs reader

Give an AI assistant this prompt:

Explain the difference between csv.reader and csv.DictReader in Python. When would you choose one over the other? Give a short code example of each reading the same file.

What to look for in the AI’s response: - Does it mention that reader returns lists and DictReader returns dictionaries? - Does it explain that DictReader uses the first row as keys by default? - Does it note that DictReader is better for readability and column-order independence? - Does it mention any performance difference? (There is a small overhead for DictReader, but it is negligible for most use cases.)

Exercise 3: Evaluate AI Code Quality

Give an AI assistant this prompt:

Write a Python script that reads all CSV files in ../../data/, prints each filename and its row count, and writes a summary to data_summary.csv.

Evaluate the response against these criteria:

  1. File paths – Does it use pathlib or string concatenation? Does it handle the relative path correctly?
  2. File closing – Does it use with statements for every file it opens?
  3. CSV writing – Does it include newline='' in the open() call for writing?
  4. Edge cases – What happens if the data directory does not exist? Does the code handle that?
# Paste the AI-generated code here and run it.
# Check: does the output match what you'd expect from our data files?

Summary

Topic Key Takeaway
Files Use with open() to ensure files are always closed properly
CSV format First row = headers, subsequent rows = records, comma-delimited
Manual parsing split(',') breaks on quoted fields – do not use it for real data
csv.reader Returns rows as lists; access fields by index
csv.DictReader Returns rows as dicts; access fields by name (preferred)
csv.writer / DictWriter Write CSVs; always use newline='' when opening the file
Joining CSVs Build a lookup dictionary on the shared key, then match records
pathlib.Path Object-oriented file paths; use / operator to build paths
Type conversion All CSV values are strings – convert with int(), float() as needed

Next up: Topic 03, Notebook 02 – Working with JSON files.

Get the Complete Course Bundle

All notebooks, the full golf dataset, and new tutorials — straight to your inbox.